{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用PaddleX调用Resnet18根据数据集[年龄、性别和种族（人脸数据）](https://aistudio.baidu.com/aistudio/datasetdetail/106996)进行性别分类\n",
    "## 背景\n",
    "人脸是人类生活中暴露程度较高的部位，并且人脸能够较为明显地反馈出人的年龄，性别信息。对人脸信息进行识别，在特定场景下对人脸信息进行识别有助于提高生活的便捷程度，促进社会和谐，降低恶性事件的发生概率。如在厕所入口处进行人脸识别，可以在异性进入厕所时进行告警，从而避免女厕所的偷拍行为的出现、避免走错厕所的尴尬情形。\n",
    "\n",
    "## 数据集介绍\n",
    "[年龄、性别和种族（人脸数据）](https://aistudio.baidu.com/aistudio/datasetdetail/106996)包含了三个种族，两个性别，年龄从0到99分布的23705个人脸样本。每个人脸样本都通过一组数组进行存储，并且可以转化为48×48的灰度图像。\n",
    "\n",
    "\n",
    "部分图片示例\n",
    "\n",
    "| 0| 9999 | 11007|23033 |\n",
    "| -------- | -------- | -------- |-------- |\n",
    "| ![](https://ai-studio-static-online.cdn.bcebos.com/a23613fc17ac427aaf18145025641eeb279b2ac62dfc44e09f82d34c1c4549b7)     | ![](https://ai-studio-static-online.cdn.bcebos.com/694d7ec974f149c49026926160e92a5755b8d44491e84791820a226388b30dfc)     | ![](https://ai-studio-static-online.cdn.bcebos.com/b9885646d8d140fb98ebe80acfbb1dda5816d4c946784278b35f714ee7712364)     |![](https://ai-studio-static-online.cdn.bcebos.com/582d89735d8646c8b2f5e03cd088172ef5e358e453a640378f80178b96059ac7)   |\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## 方案介绍\n",
    "\n",
    "首先将数组转化为图相，调用PaddleX中的Resnet18进行性别分类。\n",
    "\n",
    "# 代码\n",
    "\n",
    "## 数据预处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-23T02:38:02.947683Z",
     "iopub.status.busy": "2022-02-23T02:38:02.947181Z",
     "iopub.status.idle": "2022-02-23T02:38:03.355885Z",
     "shell.execute_reply": "2022-02-23T02:38:03.355131Z",
     "shell.execute_reply.started": "2022-02-23T02:38:02.947645Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from PIL import Image\n",
    "import random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-23T02:38:08.204800Z",
     "iopub.status.busy": "2022-02-23T02:38:08.204207Z",
     "iopub.status.idle": "2022-02-23T02:38:40.328161Z",
     "shell.execute_reply": "2022-02-23T02:38:40.327344Z",
     "shell.execute_reply.started": "2022-02-23T02:38:08.204732Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# 生成图像和对应标签\n",
    "df=pd.read_csv('data/data106996/age_gender.csv')\n",
    "\n",
    "! rm -r -f img\n",
    "! mkdir img\n",
    "\n",
    "def pixels2image(pixels): \n",
    "    return np.array(pixels.split()).astype(np.float).reshape(48, 48)\n",
    "\n",
    "with open('gender_label.txt','w') as f:\n",
    "    for i in range(len(df)):\n",
    "        pixels=df['pixels'][i]\n",
    "        img = pixels2image(pixels)\n",
    "        img=Image.fromarray(img)\n",
    "        img.convert('L').save('img/'+str(i)+'.jpg')\n",
    "        f.write('img/'+str(i)+'.jpg\\t'+str(df['gender'][i])+'\\n')\n",
    "\n",
    "with open('gender_name.txt','w') as f:\n",
    "    f.write('男\\n女')\n",
    "\n",
    "with open('gender_label.txt','r') as f:\n",
    "    lines=f.readlines()\n",
    "    random.shuffle(lines)\n",
    "    train_list=lines[:int(len(lines)*0.8)]\n",
    "    eval_list=lines[int(len(lines)*0.8):]\n",
    "\n",
    "with open('train_label.txt','w') as f:\n",
    "    for item in train_list:\n",
    "        f.write(item)\n",
    "\n",
    "with open('eval_label.txt','w') as f:\n",
    "    for item in eval_list:\n",
    "        f.write(item)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用PaddleX进行分类任务训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# 安装paddlex\n",
    "!pip install paddlex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# 导入包\n",
    "import os\n",
    "import random\n",
    "import numpy as np\n",
    "from random import shuffle\n",
    "import cv2\n",
    "import paddle\n",
    "from PIL import Image\n",
    "import shutil\n",
    "import re\n",
    "from paddle.vision.transforms import functional as F\n",
    "import os.path\n",
    "from PIL import Image\n",
    "import paddlex as pdx\n",
    "import sys\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-18T13:57:30.396978Z",
     "iopub.status.busy": "2022-02-18T13:57:30.396300Z",
     "iopub.status.idle": "2022-02-18T13:57:30.629463Z",
     "shell.execute_reply": "2022-02-18T13:57:30.628872Z",
     "shell.execute_reply.started": "2022-02-18T13:57:30.396940Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-02-18 21:57:30 [INFO]\tStarting to read file list from dataset...\n",
      "2022-02-18 21:57:30 [INFO]\t18964 samples in file train_label.txt\n",
      "2022-02-18 21:57:30 [INFO]\tStarting to read file list from dataset...\n",
      "2022-02-18 21:57:30 [INFO]\t4741 samples in file eval_label.txt\n"
     ]
    }
   ],
   "source": [
    "train_transforms = pdx.transforms.Compose([\n",
    "        pdx.transforms.RandomHorizontalFlip(), \n",
    "        pdx.transforms.Normalize()])\n",
    "        \n",
    "eval_transforms = pdx.transforms.Compose([\n",
    "        pdx.transforms.Resize([48,48]),\n",
    "        pdx.transforms.Normalize()\n",
    "    ])\n",
    "\n",
    "train_dataset = pdx.datasets.ImageNet(\n",
    "        data_dir='/home/aistudio',\n",
    "        file_list='train_label.txt',\n",
    "        label_list='gender_name.txt',\n",
    "        transforms=train_transforms,\n",
    "        shuffle=True)\n",
    "\n",
    "eval_dataset = pdx.datasets.ImageNet(\n",
    "        data_dir='/home/aistudio',\n",
    "        file_list='eval_label.txt',\n",
    "        label_list='gender_name.txt',\n",
    "        transforms=eval_transforms)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-18T13:57:34.675564Z",
     "iopub.status.busy": "2022-02-18T13:57:34.674859Z",
     "iopub.status.idle": "2022-02-18T13:57:34.722394Z",
     "shell.execute_reply": "2022-02-18T13:57:34.721724Z",
     "shell.execute_reply.started": "2022-02-18T13:57:34.675527Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# 模型生成\n",
    "num_classes = len(train_dataset.labels)\n",
    "model = pdx.cls.ResNet18(num_classes=num_classes)\n",
    "# model = pdx.load_model('paddlex/restnet200_vd/0.94167')\n",
    "num_epochs=50"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "model.train(num_epochs=num_epochs,\n",
    "                train_dataset=train_dataset,\n",
    "                train_batch_size=1600,\n",
    "                eval_dataset=eval_dataset,\n",
    "                lr_decay_epochs=[int(num_epochs*0.4), int(num_epochs*0.7), int(num_epochs*0.9)],\n",
    "                save_interval_epochs=int(num_epochs/5),\n",
    "                learning_rate=0.00025,\n",
    "                save_dir='cls/model',\n",
    "                use_vdl=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 测试模型效果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-18T14:01:14.913894Z",
     "iopub.status.busy": "2022-02-18T14:01:14.913397Z",
     "iopub.status.idle": "2022-02-18T14:01:15.058528Z",
     "shell.execute_reply": "2022-02-18T14:01:15.057848Z",
     "shell.execute_reply.started": "2022-02-18T14:01:14.913846Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-02-18 22:01:15 [INFO]\tModel[ResNet18] loaded.\n"
     ]
    }
   ],
   "source": [
    "# 预测\n",
    "num_classes = len(train_dataset.labels)\n",
    "model = pdx.load_model('cls/model/best_model')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-18T14:06:37.650517Z",
     "iopub.status.busy": "2022-02-18T14:06:37.649847Z",
     "iopub.status.idle": "2022-02-18T14:07:09.466585Z",
     "shell.execute_reply": "2022-02-18T14:07:09.465881Z",
     "shell.execute_reply.started": "2022-02-18T14:06:37.650480Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "acc is 0.757646066230753\n"
     ]
    }
   ],
   "source": [
    "T_count=0\n",
    "F_count=0\n",
    "with open('eval_label.txt','r') as f:\n",
    "    for line in f.readlines():\n",
    "        img_dir = line.split('\\t')[0]\n",
    "        label = int(line.split('\\t')[1][0])\n",
    "        pre = model.predict(img_dir)[0]['category_id']\n",
    "        if pre==label:\n",
    "            T_count+=1\n",
    "        else:\n",
    "            F_count+=1\n",
    "print(f'acc is {T_count/(T_count+F_count)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 可视化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-18T14:11:54.941876Z",
     "iopub.status.busy": "2022-02-18T14:11:54.940975Z",
     "iopub.status.idle": "2022-02-18T14:11:55.099139Z",
     "shell.execute_reply": "2022-02-18T14:11:55.098564Z",
     "shell.execute_reply.started": "2022-02-18T14:11:54.941840Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-02-18 22:11:55 [INFO]\tModel[ResNet18] loaded.\n",
      "{'category_id': 0, 'category': '男', 'score': 0.9179523}\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAADAAAAAwCAAAAAByaaZbAAAGM0lEQVR4nCXVuZJcVxkA4H87d+vu6e5ZJdnGxmUQFBQk5BRVPCwPQEYMIUUCJGBhI3k0o5np9W7n/AsB30t8+CNffPn285RZKmFmFpKqrhJHntVMFcNUA8n73dP9x6MKErqWlIRAjYQbRC+hxGrm7gEBSICkqg5gRZDIxrGpGBEp1UmACCACkAOCwsKDAMHNAjGUgLAMg1IAsQizByAREyAxIyIhIGK4mTmiFwkIHccCxEzoWZfERKGACBQM5IYO4T7Nc845ZwHAMFVLkoSJqWJhdHNmBmWLOZwwAsZpnsdhyoKIRBAOrobEXKo2JXBnJjMrXlSBwKyYay4GgkREBG6l5NkAN90GOAk4hY6nSUWNGLQAMkIgS1mOzWft+U53OT+dCJebq+12k2xuj//dnU/adinqTccVrZ4PVGVJF5VAW077c6RVl8c8VKnNYXo4TMPLVPVcVzw7T1Y3ggFCFSI2cZxXbZ1Yp14SzGc2H8oyrdJ2R+zDWVoqRmNJW0mzLgTB8mXX70pIT1Jn23RlMM+gT6+zhc8zdNlhWt29ktspXzdqqcHn97tTqX/NHL1cdMeYz2ea7+9hebOU3jSwqb761U/km2912QS1C9idO+mrLyjGsWSfJ3Vc1E/voD3dLS/gRNymq8tGvnjJiZgbz756jdP6QOIIALVTM5usbqfhxLKtZnPAhT/Isk3hLSVsbu2V++fvAT1DPkPBNch63ZbjgRM5RzHZdBNNvymdkV4s2s9eq8YHYiFAqX3h85KOMOryyx9vKbqN8Taf3krLy0hRYQPgjsQ5bLIgAKC6AxcKJAI3H1K/wC3LVbl1sS6QiQCJWp2VUYgB0kqMOALCjfBj45v2rqKrxZ15ZjMHANeC6BYsTGDBSSAiPMKdXVG6VIRgdT8wugcQeAS4FidysFICITQCwSNKHFSbI8q5dGlcGBQgIgIgAg8MDTcPTFIiACCipOLrlYJkWW8jqQMQMiJGuAOCEkI4ELKbe5gat+cvL8CJus2VVFaICAAAiqoFAjBTmHkIE7jmXDQNt9IITZbsphxTOJ6WXEWEBs152ay78fFpKNF26ARj0Y91K1c/yL9+9vnrP99eKKFDnssUOk9BUOPuCFXuxxsLwrCSfTWRv38nf/z+tze9UNZK7FhwEvcylzGqw9SkWuClYXX3iEOZh/Of/iB/f/f4u1WrtWmBE6ZASWnOfoBeAamWfVCEFYuW//rv8m6Upv/L0y8LV8aAu6pRZUMmywdFeikXCzSL0KKKun/M44XMCf721muTsHhsai2VarhNjphfTqdVAa5r4ArK7mb9/N1RnNqpzFVOiNhbLl6CUpVjY+46+zRwu64khS3w8mnoRJbnsTq8yKLCiqsk1iWQehpTV4rRXIxZ6kTFSv/papM/PErfDekU55uxifzNIScLJroW0FzKFEb01YebtDeW/robOkYBhfKw+axfBdfSKP5znkOqJI0WVVquNquLRSkRp16hnEslrKIPzeYJQiiJx88PPQhKfeKkmq4uO2+h72V6GVlP+1lEpjqmp/xYBUdi9yrVVQqqL8xMoasty/lotHsusH88mwsAckz+spEGrUJvcyAgU+QChDqwyWEmffqEdLjPmKWAWTKOx+BcdTWOZTwVStKNJXUN0ayx53y8n6dptycGCQ6rrE3fI9VptZJuxTEgTBR121ZVGvfz/vLTDy833z7M7iGCtQFGU/1jGJVJ6tiso4p5xna5YLWg/jxcHZ/81buB0+gVsc/r7KdXX3/6j0WvQpNvbpfL1zUJFaBp9BN9+MF+un22yaFBkcqN5XJaP58/XjRNtfJCkSAaZy+AZXA6jXSzUo8IVxdlzPT1m+Fy/92DdbzsAAlFQmbUyXE69tPjfnvdqgUihouWFPL7653cvhxPY1xR1xAQIyKEFxum5/051xc8WTAzg1STYPu771O+fBPTfl5tLpmdgSyYK89T/3Cqb69bQkSA/0dPfH137LG5u0t5/+2hzxpEkEsQ+dx/eB7bNxvk9bJh8AjJiyRvy2Js8hqchpeVl3ax9KKa5XR8fnxIq+1NHfL6FQyOEWJraH5hXxjk+qpUL+/vh8PqymiUUtFx9+lB32y2ayr1q5sJZ0L8Hx60UcUvhXQ1AAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<PIL.JpegImagePlugin.JpegImageFile image mode=L size=48x48 at 0x7F543825E5D0>"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num_classes = len(train_dataset.labels)\n",
    "model = pdx.load_model('cls/model/best_model')\n",
    "print(model.predict('img/11046.jpg')[0])\n",
    "from PIL import Image \n",
    "img = Image.open('img/11046.jpg')\n",
    "img"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 总结\n",
    "本项目使用PaddleX极简的构造了一个人脸分类模型，最终的分类结果精度仅有75%可能是因为包含了大量的幼年（婴儿）人脸，从而难以辨别，对学习器学习带来了一定的压力。可以考虑在做分类时将这部分样本剔除。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 个人介绍\n",
    "\n",
    "> 作者：笠雨聆月\n",
    "\n",
    "> 兴趣：目前从最容易上手的cv进行学习，也在尝试nlp，gan等等，各种方向来者不拒\n",
    "\n",
    "> 个人主页：https://aistudio.baidu.com/aistudio/personalcenter/thirdview/608082"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "请点击[此处](https://ai.baidu.com/docs#/AIStudio_Project_Notebook/a38e5576)查看本环境基本用法.  <br>\n",
    "Please click [here ](https://ai.baidu.com/docs#/AIStudio_Project_Notebook/a38e5576) for more detailed instructions. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "py35-paddle1.2.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
